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A NEW STATISTIC FOR EVALUATING ITEM RESPONSE THEORY MODELS 


FOR ORDINAL DATA 

Li Cai and Scott Monroe 
CRESST/ University of California, Los Angeles 

Abstract 

We propose a new limited-information goodness of fit test statistic C 2 for ordinal IRT 
models. The construction of the new statistic lies formally between the M 2 statistic of 
Maydeu-Olivares and Joe (2006), which utilizes first and second order marginal probabilities, 
and the M 2 statistic of Cai and Hansen (2013), which collapses the marginal probabilities into 
means and product moments. Unlike M 2 , C 2 may be computed even when the number of 
items is small and the number of categories is large. It is as well calibrated as the alternatives 
and can be more powerful than M 2 . When all items are dichotomous, C 2 becomes equivalent 
to M 2 , which is also equivalent to M 2 . We analyze empirical data from a patient -reported 
outcomes measurement development project to illustrate the potential differences in 
substantive conclusions that one may draw from the use of different statistics for model fit 
assessment. 

Keywords: item response theory, goodness of fit, limited-information 

Introduction 

Recent years have witnessed an increased interest in the formal evaluation of item response 
theory (IRT) models (Maydeu-Olivares, 2013). In particular, great technical strides have been 
made in the area of limited-information fit statistics (Bartholomew & Leung, 2002; Maydeu- 
Olivares & Joe, 2006; Joe & Maydeu-Olivares, 2010). In contrast to the classical full- 
information statistics such as Pearson’s X 2 statistic or the likelihood ratio statistic G 2 , which 
utilize full response pattern frequencies and residuals, these limited-information statistics are 
based on observed and model-implied lower-order margins, e.g., first- and second order marginal 
frequencies. As Thissen and Steinberg (1997) noted, for a handful of polytomous items, the 
contingency table upon which the item response model is defined becomes extremely sparse. The 
full-information test statistics do not approach the asymptotic chi-square reference distributions 
under such sparseness (see e.g., Bartholomew & Tzamourani, 1999) and consequently have 
limited utility for evaluating IRT models, particularly for ordinal data. On the other hand, 
because the lower-order margins tend to be better filled than the full item response cross- 
classifications, limited-information statistics not only retain better calibration than full- 
information statistics under the null, but are also more powerful under the alternative (Joe & 
Maydeu-Olivares, 2010). Maydeu-Olivares and Joe’s (2006) M 2 and Cai and Hansen’s (2013) 


4 



M 2 statistics are examples that have found their way into widely distributed software (Cai, 
Thissen, & du Toit, 2011; Cai, 2013) and have begun to demonstrate their usefulness in 
empirical measurement research that requires evaluating IRT model fit for ordinal data. 

Technical and practical challenges remain, however, and we submit that both the original 
M 2 statistic, which is based on uncollapsed first- and second order marginal residuals, and the 
Cai-Hansen updated M 2 statistic, which utilizes a further condensing/collapsing of the first- and 
second order marginal residuals into residual moments, have limitations that result in diminished 
practical utility for IRT models fitted to ordinal data. The original M 2 statistic suffers from a 
more subtle sparseness issue that Cai & Hansen (2013) discussed. For example, suppose two 
ordinal items each with 5 categories (perhaps on a Likert-type scale) both load strongly on the 
same latent variable(s). Then, the observed item responses will tend to covary. Respondents who 
endorse the extreme response categories for item 1 tend to have similar responses for item 2. By 
virtue of the shared underlying latent variable(s), certain cells in the bivariate contingency table 
will have very small expected frequencies, e.g., the combination of the most positive response 
option on item 1 and the most negative response option on item 2. The number of cells that may 
be sparse is exacerbated by an increase in the number of categories, eventually leading to a 
break-down of the asymptotic chi-square approximation. Generally the statistic will be 
stochastically smaller than the reference distribution, leading to lower than nominal Type I error 
rates under the null, and a loss of statistical power under the alternative. Cai and Hansen (2013) 
proposed M 2 as a remedy because it uses conventional item scores (0-1-2-3-...) assigned to 
ordinal categories to compute residual moments from the first- and second order margins. Given 
our 2-item 5-category example from above, the original M 2 would require 2 x (5 — 1) = 8 first 
order expected marginal probabilities, and (5 — 1) x (5 — 1) = 16 second order expected 
marginal probabilities. 1 For M 2 , the 8 first order marginal expected probabilities are used to 
compute 2 first order moments, and the 16 second order expected marginal probabilities 
collapses into a single second order moment. This further collapsing guarantees that sparseness 
no longer affects M 2 , and the statistic is shown to be well-calibrated and powerful in Cai and 
Hansen’s (2013) simulations. 

A major issue still remains: M 2 appears to have collapsed the contingency table far too 
aggressively. For assessments made up of items with 5 ordered response categories (very popular 
in social and behavioral sciences research), it would take at least 10 items for M 2 to begin to 
have positive degrees of freedom, even for a simple unidimensional graded response model. For 
9 items, there are 9 + 9 x (9 — l)/2 = 45 first- and second order residual moments, but a 

1 The number of first order marginal probabilities is equal to 4 because the 5 probabilities must sum to 1.0 
and there are only 4 independent probabilities. The same argument applies to the bivariate table. 
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unidimensional graded response model for 5 categories also has 45 parameters, leaving zero 
degree of freedom for model fit testing with M 2 . In this case, the IRT model is not locally 
identified from the set of marginal residual moments. We note that, for instance, in the patient- 
reported outcomes measurement context (Hansen, Cai, Stucky, Tucker, Shadel, & Edelen, in 
press), most short forms contain fewer than 10 items and the items tend to have ordinal response 
formats. Thus, M z cannot be used to evaluate model fit for such short form measures. This 
clearly limits the utility of M 2 . 

However, a closer examination of the logic of the further collapsing used in M z shows that 
the situations for the first order margins and the second order margins are in fact very different. 
Typically, the first order margins are adequately filled, (mostly) as a result of standard operating 
procedures in routine item analysis. If a response category is endorsed by very few respondents, 
either the item is removed altogether, or the categories are collapsed before model-fitting 
commences. In other words, in practice, sparseness of the second order margins is generally not 
accompanied by sparseness of the first order margins. 

Therefore we arrive at the dominating insight of this research: The first order margins 
should not be collapsed into moments, but the second order margins should. We propose a new 
test statistic that stands between the original M 2 , which does not collapse marginal residuals, and 
M z , which further collapses the marginal residuals into moments. The new statistic still relies on 
first and second order information, but only the second order margins are collapsed into 
moments. This new statistic remedies the weaknesses of M 2 and M z . We call it C z . 

With C 2 , Samejima’s (1969) unidimensional graded response (GR) model or Muraki’s 
(1992) unidimensional generalized partial credit (GPC) model is locally identified (and has 
positive degrees of freedom) for as few as 4 items. More generally, unlike with M 2 , the ability to 
compute C 2 does not depend on the number of categories per item. We show that C z is as well 
calibrated as the competition (namely M 2 and M z ), and can be more powerful. Finally, for C z , the 
structure of the first and second order margins has an appealing connection to the parameters of 
an IRT model. The uncollapsed raw first order margins are strongly related to the item location 
parameters, while the collapsed second order margins (i.e., moments) are essentially covariances 
and are directly related to the item discrimination/loading parameters. 

The remainder of the paper is organized as follows. We introduce basic notation in Section 
2, and discuss maximum marginal likelihood estimation in Section 3. Properties of multinomial 
residuals are demonstrated in Section 4 to facilitate the introduction of the proposed test statistic. 
In Section 5, we report results from a simulation study to examine the calibration and power of 
the new test statistic. In Section 6, empirical data from a patient-reported outcomes measurement 
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development project will be used to further illustrate the new statistic. We conclude with a 
discussion of possible future research directions. 


Some Notation 

Let there be a total of i = 1, ...I items. For an item with K t ordered polytomous responses, 
let the response categories be coded as k = 0, ... K t — 1. Let 9 denote the underlying latent 
variable, and 7)(/c|0) the category response function for item i and category k. Without loss of 
generality, let us consider a logistic version of Samejima’s (1969) graded response model for the 
remainder of this paper, noting that the theory developed in the sequel applies equally well to 
other ordinal IRT models such as the GPC model. The graded model sets the cumulative 
response function for item i in categories k and above as 

1 (1) 


rnm = 


1 + exp [~(a ik + fad)]’ 


for k — 1, ...Ki — 1, where a ik is the intercept (location) and /?, is the slope (discrimination) 
parameter. Let the boundary cases be Ti~(0\9) — 1 and 7) + (/fi|0) — 0. The category response 
function can be written as 


T i (k\e') = Tt(k\e')-Tt(k + m, 


( 2 ) 


for k — 0, ... K t — 1. 


Let Y[ be a random variable whose realization y t is a response to item i. The probability 
mass function of Y t , conditional on 9, is that of a multinomial with trial size 1: 

K J^ (3) 

p(Y i = y i \9-Y) = [ 


k = 0 


where l^Cy;) is an indicator function such that 

Wy,) = (J; 


if k = y t 
otherwise 


(4) 


and y collects together all item parameters. Let the dimensionality of y be equal to d, which is 
the number of free and unconstrained parameters in the model. 

Under the assumption of conditional independence (Lord, 1968), the conditional 
probability for the response pattern y — (y lt ... , y ; )' factors into a product: 

^(0;y) = p (fl^ = yi e,Y ] = f['« 

Vi= l / i = l 
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Assuming that the latent variable distribution has density g(9), typically standard normal in 
applications, the marginal probability of the response pattern is 


i 

TtyO) = j = yi\O m >Y) gffldd. 

i = 1 


( 6 ) 


Recall that K t is the number of categories for item i. For I items, the IRT model generates a 
total of K = UUKi cross-classifications or possible item response patterns in the form of a 
contingency table. For example, with 2 dichotomous items, the 4 possible response patterns are 
(0,0), (0,1), (1,0), (1,1), in reverse lexicographical order. Note that K may become very large 
for polytomous items, e.g., for ten 5-category items, K is just under 10 million. 

Maximum Likelihood Estimation of IRT Models 


Based on a sample of N respondents, let the observed proportion of individuals with 
response pattern y be denoted as p y . These observed proportions can be collected into a K X 1 
vector p. Correspondingly, the K model-implied probabilities n y (y) can be collected into a 
K x 1 vector 7r(y). We recognize that it is a parametric structural model. The model-implied 
probability vector 7r(y) imposes a specific moment structure on K — 1 independent probabilities 
(as the sum of the K probabilities must be 1) with d parameters. 

Suppose there is a K x 1 vector n 0 containing the true (population) response pattern 
probabilities. If the IRT model is exactly correctly specified, i.e., it fits perfectly in the 
population, then there exists a parameter vector y 0 such that 7r(y 0 ) = n 0 . The elements of y 0 
may be taken as the true parameters. When this is the case, parameter estimation is 
straightforward. 

The sampling model for this contingency table is that of a multinomial with K cells and N 
trials (Reiser, 1996). The log-likelihood for the item parameters y is proportional to 

Z (7) 

Py l°g 7r y(y), 

y 

where the summation is nominally over all K response patterns. In reality, when a particular 
pattern is not observed in the data, the corresponding p y is zero and the term does not contribute 
to the log-likelihood. Maximization of the log L(y), e.g., with Bock and Aitkin’s (1981) EM 
algorithm, leads to the maximum marginal likelihood estimator y. 

It is a standard result from discrete multivariate analysis (e.g., Bishop, Fienberg, & 
Holland, 1975) that the maximum likelihood estimator is V/V-consi stent, asymptotically normal 
and asymptotically efficient under correct model specification. In other words, we have 



( 8 ) 


VyV(y-y 0 ) " ^(O,^- 1 ), 

where T 0 = A' 0 [dtag(7T 0 )J _1 A 0 is the d x d Fisher information matrix evaluated at the true 

parameter values, and the Jacobian A 0 is a K x d matrix of all first order partial derivatives of 

the response pattern probabilities with respect to the parameters, evaluated at y 0 : 

. d7r(y 0 ) 

Furthermore, let n y = n y (y) denote the model-implied probability for response pattern y 
under maximum likelihood estimation. The direct comparison between n y and p y leads to 
classical full-information fit statistics such as the likelihood ratio G 2 and Pearson’s X 2 : 



Under the null hypothesis that the IRT model fits exactly, these two statistics are 
asymptotically distributed as central chi-square variables with degrees of freedom equal to K — 
1 — d, against the general multinomial alternative (Bishop et al., 1975). Unfortunately for IRT 
models, K is exponential in the number of items. As argued earlier, when K is large, the expected 
response pattern probabilities necessarily become small, resulting in an extremely sparse table. 
This sparseness invalidates the chi-square approximation and renders these full-information 
statistics unsuitable for model fit testing (Cochran, 1952). 

Limited-information Goodness-of-fit Testing 
Distribution of Multinomial Residuals under Maximum Likelihood Estimation 


It can be shown that the asymptotic distribution of (p — 7 r 0 ) is A -variate normal: 

V(V(p -tt 0 ) ^ JVk(0,S 0 ), 


( 10 ) 


where E 0 = diag(7T 0 ) — 7T 0 7To is the population multinomial covariance matrix. Recall that 
fc y — n y (y) is the model-implied probability for response pattern y under maximum likelihood 
estimation. The K model-implied probabilities may be collected into a K x 1 vector n — 7r(y). It 
can also be shown that the residual vector (p — n) is asymptotically A -variate normally 
distributed under maximum likelihood estimation, albeit with a different limiting covariance 
matrix to take maximum likelihood estimation of parameters into account, 

— p)~> JV^(0,£ 0 ), < n > 

where £ 0 = E 0 — AoJ^A'q (Bishop et al., 1975). 
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For example, for a test made up of 3 items, where item 1 is dichotomous and items 2 and 3 
have 3-categories each, there are 18 possible item response patterns. In reverse lexicographical 
order, the model-implied response pattern probabilities and observed proportions are: 


/7T(000)\ 

f £(001) l 


/P(000)\ 


O 

O 


' P(001) j 

A 


P( 002) 

tf(OlO) 


P(010) 

£(on) 


P( Oil) 

£(012) 


P( 012) 

£(020) 


P( 020) 

£(021) 


P(021) 

£(022) 


P( 022) 

£(100) 

- V = 

P(100) 

£(101) 


P(101) 

£(102) 


P(102) 

£(110) 


P(110) 

P(lll) 

£(m) 


P(112) 

£(112) 


P(120) 

£(120) 


i P(121) J 

£(121) 1 
\£(122 )/ 


\P( 122)/ 


(12) 


First Order Margins 

Using this arrangement, marginal probabilities can be obtained as linear functions of ft and 
p. Consider the 3-item example from above. There are 5 independent first order marginal 
probabilities: 1 from item 1, which is dichotomous; 2 from items 2 and 3 each. In general, for / 
items, there are = Yli=i (^i — 1) independent first order marginal probabilities. Without loss 
of generality and by convention, we can obtain an independent set of marginal probabilities for 
item i by removing the marginal probability for the lowest category with code 0. These first 
order marginal probabilities can be obtained from the full /^-dimensional probability vector using 
a q 1 x K reduction operator matrix (see e.g., Joe & Maydeu-Olivares, 2010). The first order 
reduction matrix L is a fixed incidence matrix that contains zeroes and ones, where the ones 
serve to select and sum over those full response pattern probabilities that correspond to a 
particular item and a particular category code to yield the desired marginal probability. 
Importantly, this matrix has full row rank. An example is given below: 
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( 13 ) 




















' -(1) 1 

TT J 

1 /o 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

"■l 


0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

A (2) 

— Ln — 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

1 

1 

-(i) 

7T 

[ 2 J 


\° 

\o 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

\#?v 




















/ft(000)\ 
#( 001 ) 
^( 002 ) 
#( 010 ) 
#( 011 ) 
#( 012 ) 
#( 020 ) 
#( 021 ) 
#( 022 ) 
#( 100 ) 
#( 101 ) 
#( 102 ) 
#( 110 ) 
#( 111 ) 
#( 112 ) 
#( 120 ) 
#( 121 ) 
V #(122 )/ 


where n 1 denotes the qq-vector of all independent first order marginal probabilities, and 7T- 
denotes the first order marginal probability for item i in category k. By analogy, p x — Lp is a 
c/j -vector of independent first order observed proportions. The elements of p 1 may be denoted 
p. . Comparing n 1 and p 1 can show the extent to which the IRT model has successfully 
reproduced the first order proportions. 


On the other hand, Cai and Hansen (2013) considered collapsing the marginal probabilities 
into marginal moments (see also Joe & Maydeu-Olivares, 2010). Cai and Hansen (2013) 
reasoned that for each ordinal item, one could use the usual category codes to compute an item 
mean from both the model-implied and the observed first order marginal probabilities: 

K i~ 1 K[ — l ( 14 ) 

fit = ^ kn {k) , nrii = ^ kpj k) - 

k = 0 k = 0 


This setup has the side benefit of effectively eliminating the first category for each item so that 
no special treatment is required to obtain independent probabilities. They also showed that the 
computation of item means can be computed via reduction operator matrices. In general one only 
has to pre-multiply L by a I x qq block-diagonal matrix R. The I diagonal blocks of R are row 
vectors made up of item category codes (sans 0): r t — (1, ... , K L — 1). The reduction operator 
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matrix L* = RL is of order / x K. Because R has full row rank, L* has full row rank as well. An 


example of item mean computation for the 3-item test from above is shown below: 


Ui = 



— Rif | 



r 2 



/I 0 0 0 0\ 
0 12 0 0 
Vo 0 0 1 2/ 



= (RL)tt = L*7T, 


(15) 


where is a vector containing the / model-implied item means. By analogy, m 1 = L*p is a 
vector of observed item means. The model-implied and observed means can be compared just as 
in model fit assessment for mean and covariance structure models. 


Second Order Margins 


Table 1 

Bivariate Table of Marginal Probabilities for Item Pair (2,3) from the Example 


Item 2 Category Code 

Marginal Probability 

Item 3 Category Code 0 12 for Item 3 


0 

#(00) 

“32 

7T (01) 

“32 

#(02) 

“32 


1 

7T (10) 

“32 

7T (11) 

“32 

7T (12) 

“32 


2 

#(20) 

“32 

7T (21) 

“32 

#(22) 

“32 

-(2) 

7T 

“3 

Marginal Probability 
for Item 2 

7T (0) 

“2 

-(1) 

^2 

-(2) 

^2 

1.0 


✓s (kl) 

Generalizing the notation from first order marginal probabilities, let ny- denote the 
second order marginal probability for item pair where item i is in category k and item j is 
in category Z. With / items, there are /(/ — 1) / 2 item pairs for 1 < j < i < I. For each pair, 
these second order marginal probabilities form a K t x Kj contingency table. Table 1 presents an 
example using the 3-item test from above. Each cell of the table corresponds to a second order 
marginal probability. On the margins of the table are the first order probabilities. Locally in this 
two-way table, given the first order margins, there are only (K t — 1) x (Kj — 1) independent 
second order probabilities. The shaded cells indicate joint probabilities that we routinely remove 
to obtain independent probabilities. By this point it should not become a surprise that these 
second order marginal probabilities can be obtained via reduction operator matrices. These 
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reduction matrices are also fixed incidence matrix that contain zeroes and ones only. Each row of 
a reduction matrix sums over the response pattern probabilities corresponding to a particular 


second order marginal probability. As such a reduction operator matrix has full row rank and the 
number of rows is equal to q 2 — £j=i — 1) x (Kj — 1). The number of columns is equal 

to K. An example is shown below: 


^2 


/ 


n 


n 


n 


n 


n 


n 


n 


(n)\ 
21 \ 
( 21 ) 
21 
( 11 ) 
31 
( 21 ) 

31 
( 11 ) 

32 
( 12 ) 
32 
( 21 ) 
32 


Vftg 2 V 


= L.7T = 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

°\ 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 \ 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0/ 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

V 


ft, 


(16) 


where L is the q 2 x K (in this case 8 x 18) reduction matrix, and ft 2 denotes the q 2 ~ vcclor of all 

independent second order marginal probabilities. Again by analogy, p 2 = L p is a q 2 ‘ vector °f 

(kl) 

independent second order observed proportions. The elements of p 2 may be denoted pk . 
Comparison of ft 2 and p 2 tells us how well the IRT model fits the second order proportions. 


Returning to the example in Table 1, Cai and Hansen (2013) noted that if there is reason to 
believe that the items in a test are strongly influenced by a common latent variable, as is typically 
the case due to common assessment development practices in educational and psychological 
testing, the second order marginal probabilities for the “inconsistent” response patterns in the 
two-way table will necessarily become small. For example, if both items 2 and 3 provide 
evidence about the respondents’ severity in depression symptoms, then in aggregate, 
endorsement of a category indicating high severity on item 2 will tend to be correlated with 
endorsement of a similar category on item 3. Thus the cells in Table 1 that are close to the main 
diagonal (i.e., the consistent response patterns) will be better filled than the cells that are far 
removed from the diagonal (i.e., the inconsistent response patterns). As the number of categories 
increases, the sparseness of the second order margins becomes increasingly severe. Some of the 
observed second order marginal proportions could be equal to zero, and the model-implied 
probabilities are similarly small. Thus a direct comparison of ft 2 and p 2 has limited utility in 
practical data analysis settings involving ordered polytomous IRT models. 
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This observation led Cai and Hansen (2013) to apply Joe and Maydeu-Olivares’ (2010) 
general framework. Instead of examining the two-way probabilities, Cai and Hansen used the 
ordinal item scores to compute a raw moment statistic for each item pair: 

Ki-lKj-l Ki-iKj-1 (17) 

fiij = ^ ^ kl ft , mu =Y J Y J kl Pi] 0 ’ 

k = 0 1=0 k = 0 1=0 

where p-ij and m tJ are the model-implied and observed second order moments for item pair 
The moment statistic further collapses the two-way contingency table into a single-number 
summary, thereby avoiding the sparseness issue even when the number of categories is large. 
The bivariate moments are effectively measuring pairwise correlations between items. 


Again, the second order moments may be computed via reduction operator matrices. The 
reduction operator matrix for second order moments can be obtained by pre-multiplying L with a 
/(/ — l)/2 x q 2 block-diagonal matrix R. The /(/ — l)/2 diagonal blocks of R contain 
Kronecker products of row vectors that are made up of item category codes: r, ® r j, for 
1 < j < i < /, where r t — (1, — 1) is as defined in Section 4.2. Note that R has full row 

rank. Thus L, = RL has full row rank. An example for the 3-item test is shown below: 


P'2 = 


( P21) 

\ 1 

P31 
\fi32 J 

1 = r ^ 2 = ( 


r 2 ® r x 


r 3 ® r 1 


n, 


r 3 ® r 2 


/I 2 0 0 0 0 0 0\ 
00120000 
Vo 0 0 0 1 2 2 4 / 




TT 


n 


n 


n 


n 


n 


( 21 ) 

21 

( 11 ) 

31 

( 21 ) 

31 
( 11 ) 

32 
( 12 ) 
32 
( 21 ) 
32 




(RL)tt = L*7T, 


(18) 


where } u 2 is a vector containing the /(/ — l)/2 model-implied second order moments. By 
analogy, we define m 2 = L * 2 p as a vector of observed second order moments. 

Existing Test Statistics: M 2 and M 2 

Maydeu-Olivares and Joe (2006) proposed the M 2 statistic, which utilizes the first and 
second order marginal probabilities. Let L be a (q^ -1- q 2 ) x K matrix that vertically concatenates 
L and L such that its first q x rows come from L and the remaining rows come from L. What this 
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implies is that by pre -multiplying L with n and p, we can obtain the (q { + q 2 ) x 1 vector of first 
and second order marginal residual probabilities (7T 12 — p 12 ) as a linear function of the 
multinomial cell residuals (n — p) defined in Equation (11): 


- p u = («* _ p‘) = (•■ ) (* - p) = L » - p)- 


(19) 


Equation (19) implies that the asymptotic distribution of (fr 12 — p 12 ) is normal: 

ViV(7r 12 - p 12 ) * M qi+q2 (0, Z 12 ), (20) 

where Z 12 = L£ 0 L' - LE 0 L' — LA 0 J’ 0 " 1 A' 0 L' = S 12 — A 12 J 7 ( )" 1 A' 12 . In particular the marginal 
Jacobian matrix A 12 = LA 0 is the matrix of all partial derivatives against the first and second 
order marginal probabilities 

. , dn(Yo) 9Ltt(7 0 ) dn 12 (y 0 ) 

dy' dy' dy' 


The rank of A 12 determines whether the IRT model is locally identified from the marginal 
probabilities. If A 12 has full column rank, the model is locally identified. 

Let E = diag(n) — nn' be the multinomial covariance matrix evaluated at the maximum 
likelihood solution y, and let S 12 = LEL'. Also evaluate the marginal Jacobian A 12 at y: 


^12 — 


dn 2 (y) 

dy' 


( 21 ) 


By Proposition 4 in Browne (1984), the test statistic 

M 2 — N ( n 12 — P12) ^12 ( 7r i2 — P12)' 

where ft 12 = E^ 1 — Ei 2 1 A 12 (A' 12 E£ 2 1 A 12 ) A' 12 Ei 2 , is asymptotically chi-squared with q ± + 

q 2 — d degrees of freedom under the null hypothesis that the model fits exactly in the population. 

Similarly, let L* be a /(/ + l)/2 x K matrix that vertically concatenates L* and L, such 
that its first / rows come from L* and the remaining /(/ — l)/2rows come from L*. The 
/(/ + l)/2 x 1 vector of first and second order marginal residual moments (p 12 — m 12 ) is a 
linear function of the multinomial cell residuals (n — p) : 


H12 


m 12 = 


fUi-mA = /L*\ , 

\p 2 -m 2 ) VL/ 


n~p) = L(ft-p), 


( 22 ) 


and the asymptotic distribution of (p 12 — m 12 ) is normal: 

ViV(p 12 — m 12 ) -» JV} (/+1)/2 (0,E? 2 ), 


(23) 
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where V 12 = L,Z 0 U = L*E 0 L'* - UAoT^A'oK = SJ 2 - A^Tq 1 ^)' . Let t* 12 = L*SL'* 
and evaluate the Jacobian with respect to marginal moments A* 12 at y: 

_ dn (?) _ dUnjy) _ dni 2 (y) 

12 * dy' dy' dy' 


By an analogous argument, the test statistic 

M 2 — lV(p 12 — m 12 ) n 12 (/ 1 12 — m 12 ), (24) 

where ft 12 = (“12) ~ (“12) ^12 [(^12) (“12) ^12] (^12) (“12) / is a l so 

asymptotically chi-squared, but with I(J + Y)/2 — d degrees of freedom under the null 
hypothesis that the model fits exactly in the population (Cai & Hansen, 2013 ). Note that 
when / is small, the degrees of freedom may become negative for polytomous items. 

The Proposed Test Statistic 

Given the foregoing development, we are ready introduce the new statistic. Let q = qy + 
/(/ — l )/2 denote the number of first order marginal probabilities and the number of second 
order marginal moments. Let M be a q x K matrix that vertically concatenates L and L* such that 
its first q t rows come from L and the remaining /(/ — l )/2 rows come from L*. Let <r(y) — a — 
Mff = M7r(y) be the g x 1 vector of model-implied first order marginal residual probabilities 
and second order expected marginal moments, and s — Mp be the corresponding observed 
proportions and sample moments, i.e., 

a = m = (a!) = (l) ' n = Mft ' s = Mp = (mj = (l.) p = Mp (25> 


The q x d Jacobian matrix J 0 = J(y 0 ) = MA 0 = MA(y 0 ) is therefore 

. ir \ ^ttCxo) dM7r(y 0 ) da (y 0 ) 
Jo = l(Ko) = M—gy— = gy = ~dY~' 


Note that the number of independent first order marginal probabilities q x is generally equal to the 
number of location/intercept parameters in GR or GPC models. As long as the number of 
discrimination parameters does not equal or exceed the number of second order marginal 
moments, q is typically larger than d and J(y) may have full column rank, indicating local 
identification of the IRT model, in contrast to the case of M 2 . 

It is clear then the q x 1 marginal residual vector ( a — s) is still a linear function of the 
multinomial cell residuals (fr — p) as defined in Equation ( 11 ): 


9 - 5 = (fc - mj = (l) (* - p) = M(* - p). 
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( 27 ) 


Equation (26) implies that the asymptotic distribution of (a — s ) is q -variate normal: 

VlV(a-s) ^ JVq(0, <D), 

where 4> = MZ 0 M' = MS 0 M' - MAqT^A'oM' = MS 0 M' - ] q Fq 1 ]' q . 

Let Y = MS 0 M' and let Y = MEM be an estimate under maximum likelihood estimation. 
Define a weight matrix ft = Y - 1 — r— 1 |[|'r- 1 I]“ 1 I'r— !, where J is simply the Jacobian J(y) 
evaluated at y. We argue that the new test statistic 

C 2 = N(a-s)'h(a-s), (28) 

is asymptotically chi-squared with q — d degrees of freedom under the null hypothesis that the 
model fits exactly in the population. Needless to say, C 2 , M 2 , and M 2 become equivalent when 
all items are dichotomous. 

To see this is the case, assume that I(Yo) has full column rank and the model is locally 
identified. By the continuity of the matrix inverse and the consistency of the maximum 
likelihood estimator, the probability limit of ft is ft = Y -1 — Y“ 1 J 0 [JoY -1 J 0 ] -1 JoY _1 . Since the 
statistic C 2 is a quadratic form in an asymptotically normal random vector with zero means, it is 
sufficient to show that the product of the limiting covariance matrix and the weight matrix of the 
quadratic form 4>ft — (Y — Jo)ft — * — V — 1 J 0 [Jd^" _ 1 Jo] _ 1 Jo is idempotent. This is true. 
By Cochran’s theorem and Slutsky’s theorem, C 2 is asymptotically chi-squared. The degrees of 
freedom is equal to the trace (rank) of I — Y _1 J 0 [JoY _1 J 0 ]“ 1 Jo, which is q — d. 

A Measure of Model Error 

When the model does not fit exactly in the population, there does not exist a y 0 such that 
7r(y 0 ) ~ n o- In general, for any parameter vector y, 7T 12 (y) =£ L7T 0 , jw 12 (y) =£ L*7T 0 , and 
cr(y) =£ M7T 0 , unless the misspecification only affects third-order margins or above. The limiting 
means of the random vectors in Equations (20), (23), and (27) are generally no longer zero, and 
M 2 , M 2 , and C 2 are no longer distributed as central chi-square random variables. Maydeu- 
Olivares (2013) suggested that we borrow from the model fit assessment literature in structural 
equation modeling, and utilize the quadratic forms in M 2 , M 2 , and C 2 to compute Root Mean 
Square Error of Approximation (RMSEA; Browne & Cudeck, 1993; Maydeu-Olivares, 2013) 
type indices to characterize the per degree of freedom error of approximation in the population. 

Generically, for observed discrepancy measure F, an unbiased estimate of the population 
discrepancy is F — F — df /N (Browne & Cudeck, 1993), where df is the degrees of freedom 
available for testing. The sample RMSEA estimate is defined (with truncation at 0): 
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s - max 


( 29 ) 



Confidence intervals of RMSEA may be easily computed from the noncentral chi-square 
distribution by following established procedures in Browne and Cudeck (1993). 

Let F m — (jt 12 — P 12 ) ^12 (tt 12 — p 12 ) be the observed discrepancy measure based 
on the M 2 statistic, F* M — (p 12 — (p 12 — m 12 ) the observed discrepancy measure 

based on the M 2 statistic, and finally let F c — (a — s)'il(a — s) be the observed 
discrepancy measure based on the C 2 statistic. Then we can define i M , s* M , and i c as 
three different versions of the RMSEA index, each from a different underlying test 
statistic. To the extent that the test statistics have different behavior under the 
alternative hypothesis, the different RMSEAs will exhibit differences in magnitude. The 
variation is important to understand because the conclusions drawn from the RMSEA 
values may be quite different, depending on which version of the limited-information 
test statistic one has chosen to evaluate the fit of the IRT model. 

Simulations 

A small simulation study was conducted to examine the calibration and power of the 
proposed statistic, C 2 . Along with C 2 , the M 2 statistic of Maydeu-Olivares and Joe (2006) and the 
fully collapsed M 2 statistic of Cai and Hansen (2013) were considered. In all conditions, a 
sample size of N — 500 was used. The data were generated using Samejima’s (1969) GR model, 
with Ki = 4 ordered response categories per item. 

All generating parameter values are presented in Table 2. For the null condition, the 
generating model was unidimensional, with 1 — 4,6, or 8, adding successively more items from 
Table 2 to the generating moel. The /? (1 column shows the slopes of the unidimensional GR 
model. As mentioned above, a shortcoming of the M 2 statistic is that it cannot be used for 
smaller models with relatively large K t , due to lack of local identification and negative degrees 
of freedom. Such is the case here, as M 2 cannot be computed except for the 1 — 8 condition. On 
the other hand, both C 2 and M 2 can be computed for all the / considered here. However, since the 
items are polytomous, it is possible that the distribution of M 2 will be distorted because of poorly 
filled second order marginal tables, as demonstrated by Cai and Hansen (2013), leading to a 
reduction of power. To study power, model misspecification was introduced through the 
presence of a second latent variable, d 2 ~ J\f( 0,1), uncorrelated with 0 1? but influencing a 
doublet of items. More specifically, for all non-null conditions, data for items 1 and 2 were 
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generated using a two-dimensional GR model, wherein the cumulative category probabilities are 
defined as 


T t + (k |0i, 0 2 ) 


1 

1 + exp[-(a ik + &A + ^!0 2 )]' 


(30) 


The additional slopes on 0 2 were set to 0.8, as shown in Table 2. The number of items for the 
non-null condition was also 7 = 4,6, or 8. The fitted model in all conditions was the 
unidimensional GR model. There were 1,000 replications under each of the 6 conditions. 


Table 2 


Generating Parameters for Simulation Study 


Item t 

an 

a i2 

a h 

Pu 

Pi2 

1 

2.0 

0.5 

-1.0 

1.5 

0.8 

2 

2.0 

0.5 

-1.0 

1.7 

0.8 

3 

2.0 

0.5 

-1.0 

1.9 


4 

2.0 

0.5 

-1.0 

2.1 


5 

1.0 

-0.5 

-2.0 

1.5 


6 

1.0 

-0.5 

-2.0 

1.7 


7 

1.0 

-0.5 

-2.0 

1.9 


8 

1.0 

-0.5 

-2.0 

2.1 



Type I Error Rate 

Table 3 displays means, variances, and empirical rejection rates for the 3 null conditions. 
Also, p-values from two-sided Kolmogorov-Smirnov (KS) tests are reported to detect any 
departures from the reference chi-square distributions. Immediately apparent is the range of 
degrees of freedom for the test statistics. While M 2 has 50 degrees of freedom with just 4 items, 
M 2 has barely positive degrees of freedom ( df= 4) with as many as 8 items. C 2 , by construction, 
lies somewhere in between. All statistics behave well. The observed mean and variance 
relationships closely track that of a central chi-square variable, with variance approximately 
equal to twice the mean. The non-significant KS p-values and observed rejection rates support 
the proposition that all of the statistics are well-calibrated under the null. Overall, the results 
support the theoretical development claiming that C 2 is approximately chi-square distributed. As 
a consequence, the power of M 2 , C 2 , and M 2 can be more directly compared. 
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Table 3 

Simulation Results: Null Conditions 


I 

Statistics 

First order 
Information 

Second order 
Information 

d 

df 

Mean 

Var 

Rejection Rates at a 
.010 .050 .100 

KS 

8 

m 2 

24 

252 

32 

244 

244.36 

463.74 

.014 

.037 

.093 

.503 


c 2 

24 

28 

32 

20 

19.86 

36.19 

.005 

.046 

.084 

.671 


M* 

8 

28 

32 

4 

4.00 

8.61 

.014 

.050 

.105 

.790 

6 

m 2 

18 

135 

24 

129 

128.70 

237.48 

.008 

.041 

.100 

.215 


C 

18 

15 

24 

9 

8.90 

16.64 

.012 

.035 

.089 

.692 


M 2 ‘ 

6 

15 

24 

-3 







4 

m 2 

12 

54 

16 

50 

50.27 

102.40 

.017 

.052 

.105 

.519 


C 2 

12 

6 

16 

2 

2.03 

4.71 

.013 

.053 

.110 

.182 


M* 

4 

6 

16 

-6 








Note. For M 2 and C 2 , first order information refers to the total number of independent first order marginal 
probabilities, and for M 2 , first order information comes in the form of item-specific marginal means. For M 2 , second 
order information refers to the number of independent second order marginal probabilities, whereas for C 2 and M 2 , 
second order information are bivariate product moments. Note that for two conditions, M 2 cannot be computed 
because of negative degrees of freedom. 

Power 

Empirical rejection rates for M 2 , C 2 , and M 2 under the non-null conditions are 
presented in Table 4. Overall, the rejection rates increase as the number of items 
decreases from / = 8 to 6 to 4. This is expected, as the misspecification only affects the 
first two items regardless of /. Consequently, the misspecification is more severe with a 
smaller number of items. Comparing the three statistics, C 2 is clearly the most powerful, 
for all conditions. As one example, consider the rejection rates at a = .05 for 1 — 8. The 
rejection rate of C 2 (.335) is nearly triple that of M 2 (.119), while the rejection rate of M 2 
(.052) barely exceeds the nominal a level. 
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Table 4 

Simulation Results: Power and RMSEA 


I 

Statistics 

First order 
Information 

Second order 
Information d 

df 

Rejection Rates at a 
.010 .050 .100 

RMSEA 

F 0 e 0 M 90% Cl 

8 

m 2 

24 

252 

32 

244 

.027 

.119 

.196 

.017 .008 .008 

(0, .020) 


C 2 

24 

28 

32 

20 

.125 

.335 

.457 

.016 .029 .025 

(0, .048) 


M* 

8 

28 

32 

4 

.011 

.052 

.112 

<.001 .001 .015 

(0, .053) 

6 

m 2 

18 

135 

24 

129 

.034 

.124 

.212 

.015 .011 .010 

(0, .024) 


G 

18 

15 

24 

9 

.188 

.386 

.506 

.014 .040 .035 

(0, .067) 


Ml 

6 

15 

24 

-3 






4 

m 2 

12 

54 

16 

50 

.043 

.146 

.237 

.011 .015 .013 

(0, .032) 


c 2 

12 

6 

16 

2 

.278 

.504 

.603 

.010 .072 .061 

(0, .123) 


Ml 

4 

6 

16 

-6 







Note. For M 2 and C 2 , first order information refers to the total number of independent first order marginal 
probabilities, and for M 2 , first order information comes in the form of item-specific marginal means. For M 2 , second 
order information refers to the number of independent second order marginal probabilities, whereas for C 2 and M 2 , 
second order information are bivariate product moments. Note that for two conditions, M 2 cannot be computed 
because of negative degrees of freedom. 

Table 4 also shows the sample mean and empirical 90% confidence intervals for i M , £* M , 
and i c . Interestingly, for a given condition, the means vary considerably depending on which 
statistic is used to compute the sample RMSEA. For instance, for the 7 = 4 condition, the mean 
of i M is .013, while the mean of £ c is .061. Under commonly used guidelines, the former would 
indicate “excellent” fit, while the latter would indicate merely “acceptable” fit. 

Some insight into this phenomenon can be gained by computing the population RMSEA 
values. For purposes of illustration, consider the 7 = 4 condition. Under the alternative model, 
the K = 4 4 = 256 population multinomial probabilities may be computed and collected in n 0 . 
Then, treating n 0 as p, that is, treating the population probabilities as the sample multinomial 
proportions, Equation (7) may be maximized under the null model to yield y and ft — 7r(y) A 
7T 0 . Using 7T 0 and ft, three different population discrepancy measures may be computed, based 
on M 2 , C 2 , and M 2 . These population discrepancy measures, in turn, may be used to find 
corresponding population RMSEA values. 

The population discrepancy measures and population RMSEA values are shown in Table 4, 
in the columns labeled F 0 and £ 0 , respectively. Generally, for any condition and statistic, the 
mean of the sample RMSEA values is quite similar to its population RMSEA value, indicating 
consistency of the sample estimation. For instance, for the case of 7 = 4, for M 2 we see that 
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£ 0 = .015, while the sample estimate is .013. Similarly, for C 2 , s 0 — .072, while the sample 
estimate is .061. Note, however, that the two population RMSEA values may lead to different 
evaluations of fit, given common interpretive guidelines. This incongruity underscores the need 
to consider the underlying statistics in interpreting RMSEA. In other words, each test statistic 
provides a different measure of the same model misspecification. 

Analysis of Empirical Data 

The empirical data (a random sample of N — 1000) come from the PROMIS® Smoking 
Initiative (Edelen, Tucker, Shadel, Stucky, & Cai, 2012). One task of this initiative is to develop 
and evaluate short forms measures of cigarette smoking related psycho-bio-social constructs, 
which are more practical to administer in clinical and research settings. In this process, the 
research team considered short forms with as few as 4 items (Hansen et al., in press). As an 
illustration of the application of the C 2 statistic, we analyze a 4-item subset. The item stems, 
presented in Table 5, all pertain to perceived positive benefits of cigarette smoking. The response 
scale elicits respondents’ degree of agreement to the statements presented in the item stems with 
5 ordered categories: 0 = Not at All, 1 = A Little Bit, 2 = Somewhat, 3 = Quite a Bit, 4 = Very 
Much. 


Table 5 

Item Stems for the Four Smoking Items 

51 Smoking helps me concentrate. 

52 Smoking makes me feel better in social situations. 

53 If I’m feeling irritable, a cigarette will help me relax. 

54 Smoking a cigarette energizes me. 


A unidimensional GR model was fit to all items, and several overall tests were calculated. 
The statistics, associated probabilities, and RMSEA estimates are shown in Table 6. M 2 can not 
be computed because there are no degrees of freedom left for model fit testing. The null 
hypothesis of exact fit is rejected by all of the statistics except G 2 (p — ■ 277). However, even 
with just 7 = 4 items, the number of response patterns is K — 5 4 = 625. And due to the 
covariation among the item responses, not all of the response patterns are observed in the sample 
data. Given the sparseness and prior research on the behavior of the full-information statistics, 
we should be skeptical that G 2 and X 2 actually follow their purported distributions. 
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Table 6 

Model Fit Statistics for the Four Smoking Items 


Statistic 

V alue 

df 

V 

£ 

90% Cl 

G 2 

624.17 

604 

.111 

.006 

(.001, .012) 

X 2 

933.63 

604 

<.001 

.023 

(.020, .026) 

m 2 

245.60 

92 

<.001 

.041 

(.035, .047) 

C 2 

13.05 

2 

.002 

.074 

(.040, .115) 


Turning to the limited information statistics, M 2 and C 2 , several interesting observations 
can be made. First, using guidelines developed in the context of linear factor analysis and 
structural equation modeling for continuous data (e.g., Browne & Cudeck, 1993), assessment of 
model fit depends on whether i M or i c is used. Second, the relative magnitudes of i M and i c are 
consistent with the simulation study results, where the means of the C 2 -based RMSEA estimates 
were consistently greater than those of the RMSEA estimates based on M 2 . Again, it is apparent 
that the underlying statistic matters when interpreting RMSEAs. 


Table 7 

Marginal Frequencies for Item Pair (1,3) from the Empirical Example 





Item 1 Category Code 


Marginal Frequency 

Item 3 Category Code 


0 

1 

2 

3 

4 

for Item 3 

Observed 

0 

93 

8 

5 

0 

0 

106 

(Model-Implied) 


(87.4) 

(14.4) 

(5.1) 

(1.2) 

(0.4) 

(108.6) 


1 

129 

68 

20 

3 

4 

224 



(138.0) 

(55.4) 

(25.1) 

(6.7) 

(2.5) 

(227.7) 


2 

79 

79 

63 

15 

7 

243 



(90.1) 

(74.4) 

(49.2) 

(16.9) 

(7.2) 

(237.7) 


3 

56 

61 

76 

40 

13 

246 



(48.1) 

(67.4) 

(70.5) 

(35.7) 

(21.0) 

(242.7) 


4 

22 

32 

31 

37 

59 

181 



(14.6) 

(29.3) 

(48.2) 

(40.8) 

(50.3) 

(183.2) 

Marginal Frequency 


379 

248 

195 

95 

83 

1000 

for Item 1 


(378.2) 

(240.9) 

(198.1) 

(101.4) 

(81.4) 

(1000) 
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Finally, for the given data set, there is reason to suspect that the M 2 statistic may not 
perform well. A number of the sample second order marginal tables are poorly-filled (see e.g., 
Table 7), which might reduce the power of M 2 against model misspecification (Cai & Hansen, 
2013). On the other-hand, there is no similar concern for C 2 , since it is based on a further 
collapsing of the second order marginal tables. 

Discussion 

Motivated by Maydeu-Olivares and Joe’s (2006) seminal work, the limited-information test 
statistic M 2 has become an important new tool in formal evaluations of IRT model fit. M 2 relies 
on a comparison between the observed and expected first order and second order marginal 
probabilities. The formalism to establish asymptotic chi-squaredness of M 2 involves reduction 
operator matrices. Building on Joe and Maydeu-Olivares’ (2010) important insight that test 
statistics could be formed from linear functions of the first and second order marginal residuals, 
Cai and Hansen (2013) proposed a limited-information test statistic M 2 for polytomous IRT 
models that utilizes a comparison between observed and expected item means and second order 
moments, which are further reductions of the marginal probabilities. They show that in certain 
conditions M 2 can be more powerful than M 2 because some of the second order marginal 
probabilities can become sparse in M 2 . 

In this research, we propose a hybrid statistic C 2 that compares the observed and expected 
first order marginal probabilities in unreduced form, but further collapses the second order 
marginal probabilities into observed and expected moments. This new statistic circumvents a 
limitation of M 2 , namely, that the number of items required to compute the statistic depends on 
the number of categories per item. This is because M 2 collapses several first order marginal 
probabilities into a single number for each item, which can render the model not locally 
identified from the item means and second order moments. On the other hand, the ability to 
compute C 2 does not depend on the number of categories per item. Also, C 2 is potentially more 
powerful than M 2 because it has none of the sparseness issues associated with M 2 . We 
demonstrate the effectiveness of C 2 with simulation studies and empirical data analysis. We also 
make the observation that to the extent approximate model fit evaluation is desirable via the use 
of RMSEA indices (e.g., as advocated by Maydeu-Olivares, 2013), it is important to keep in 
mind the statistical properties of the underlying test statistics. Sample discrepancy measures 
based on different limited-information statistics, M 2 , M 2 , C 2 , or full-information statistics G 2 and 
X 2 , may paint different pictures of the degree of model error because they “estimate” different 
population RMSEAs. 
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There are obvious future directions with this line of research. We have chosen to focus on 
IRT models for ordinal data, completely bypassing nominal categories models. The development 
presented here is also limited to unidimensional IRT models. It would be desirable to implement 
and study a version of C 2 for hierarchical multidimensional IRT models, but it would probably 
require technical devices similar to those employed in Cai and Hansen (2013). Also of interest is 
the extension of these statistics to the case of IRT models that do not have continuous underlying 
latent traits. Finally, new step-down model error diagnostics would have to be developed to 
locate the source of misspecification and to explain the rejection of the overall goodness of fit 
hypothesis. There is reason to be excited about these possibilities. 
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